Intro to NGS processing

James A. Fellows Yates

2021-08-17

Why study genetics?

DNA gives instructions for functioning, growth, and reproduction of organisms: living and dead

Sources of aDNA Hofman et al. (2015) Trends in Eco. Evo. DOI: 10.1016/j.tree.2015.06.008

Carries much information about the past: ancestry, adaptation to different environments (e.g. diet, disease, etc.)

Who am I?

  • Education
    • B.Sc. Bioarchaeology (University of York, UK)
    • M.Sc. Naturwissenschaftliches Archäologie (University of Tübingen, DE)
    • Ph.D. Archaeogenetics (MPI-SHH / MPI-EVA, DE)
  • Experience
    • Number of genetics classes taken: 0
    • Number of bioinformatics classes taken: 0

i.e., YOU CAN LEARN TOO!

@jfy133: Currently funded by:

Icons designed by OpenMoji. License: CC BY-SA 4.0

Today we will

  1. Describe basics of DNA
  2. Introduce what DNA sequencing is
  3. Explain how Illumina NGS sequencing data is generated
  4. How to preprocess Illumina NGS data [Practical]
  5. How to evaluate Illumina NGS data [Practical]

Introduction to DNA

What is DNA?

Structure ADN Pradana Aumars, CC BY-SA 4.0, via Wikimedia Commons

The rules

  • Four nucleotides
    • Pyrimidines: Cytosine, Thymine
    • Purines: Guanine, Adenine
  • Base pairing: one pyrimidine with one purine
    • C with G (think: CGI)
    • A with T (think: AT-AT walker)
  • Complementary
    • C on one strand, G on the other (or v.v.)
    • A on one strand, T on the other (or v.v.)

AT-AT Walker AT-AT Walker by Nick Bluth from the Noun Project, CC BY 3.0

The rules

How do we get DNA?

Figure 17 01 02 CNX OpenStax, CC BY 4.0, via Wikimedia Commons

What about ancient DNA?

  • Basically the same, except: aDNA molecules are degraded
    • Fragmented (short molecules)
    • Damaged (modified nucleotides)
    • Contamination (aDNA in soup of modern DNA)

Sequencing ancient DNA © 2015 Lucy Reading / The Scientist. All rights reserved. Used here for training purposes only.

© 2015 Lucy Reading / The Scientist. All rights reserved. Modified and used here for training purposes only.

Introduction to DNA Sequencing

What is Sequencing?

Converting the chemical nucleotides of a DNA molecule

to

ACTG on your computer screen

Icons designed by OpenMoji. License: CC BY-SA 4.0

What is NGS?

  • Historically: Sanger sequencing
    • Slow, expensive, resource hungry
  • “Next Generation Sequencing”
    • Sequence billions of DNA molecules at once!
    • Fast and cheap!
    • Market leader: Illumina (others: PacBio, IonTorrent)

More ‘second’ generation (see: Nanopore)

Illumina HiSeq 2500 Konrad Förstner, CC0, via Wikimedia Commons

How does it work?

Replicate a strand, but add complementary fluorophore-modified nucleotide, one colour per base

Structures of fluorescent nucleotides Ju et al. (2006) PNAS DOI: 10.1073/pnas.0609513103

In Illumina: A G T C

Fire mah lazer, and take a picture! Rinse and repeat!

How does it work?

via Gfycat

Where does this happen?

On a ‘flow cell’: glass slide with synthetic DNA ‘lawn’

Next generation sequencing slide Bronner et al. (2013) Current Protocols in Human Genetics, DOI: (10.1002/0471142905.hg1802s79)

Where does this happen?

But how do you get your DNA to attach to the lawn

(and not get lost)?

  • Convert it to library:
    • Add adapters: bind to the ‘lawn’ of the flow cell
    • Add indexes: sample-specific barcode
    • Add priming sites: where enzymes start copying DNA

AATGATACGGCGACCACCACaccgacaaCCCTACACGACGCTCTTCCGATCTXXXXXXAGCACACGTCTGAACTCCAGTCACgacactaCCGTCTTCTGCTTG ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| TTACTATGCCGCTGGTGGTGtggctgttGGGATGTGCTGCGAGAAGGCTAGAXXXXXXTCGTGTGCAGACTTGAGGTCAGTGctgtgatGGCAGAAGACGAAC

[Adapter & Index Primer] [Index] [Target primer] [Target] [Target primer] [Index] [Adapter & Index Primer]

Clustering

Cluster Generation via bridge amplification DMLapato, CC BY-SA 4.0, via Wikimedia Commons

Sequencing-by-synthesis

  1. Add florescent nucleotides (complementary will bind)
  2. Wash away unbound nucleotides
  3. Fire laser & take photo
  4. Remove fluorophore
  5. Back to 1 ⤴️ [x50, x75 or x125 times, a.k.a. cycles]

Cluster Generation Abizar Lakdawalla , CC BY 3.0, via https://openlab.citytech.cuny.edu/

What does this look like?

Cluster Generation EMBL-EBI Training, CC BY-SA 4.0, via https://www.ebi.ac.uk/training/

Remember: doing this millions of times at once!

Improving quality

  • Over time, imaging reagents get ‘tired’ and more errors occur
    • Bases sometimes don’t bind, or multiple == clusters ‘desynced’
    • Base-quality: machine calculates probability it got the ‘right’ nucleotide for each photo
    • ‘Dead’ base call: typically reported as N
  • How to improve or correct?

Improving quality

  • Improvement: paired-end sequencing
    • Get order of nucleotides by sequencing from one end
    • Get reverse order of nucleotides - sequence other end!
    • Bonus: sequence more of read longer than cycles

MiSeq™, HiSeq™ 1000/1500/2000/2500 and NovaSeq™ 6000 v1.0 reagents paired-end flow cell, © 2021 Illumina, Inc. All rights reserved. Used here for training purposes only

© 2021 Illumina, Inc. All rights reserved. Used here for training purposes only.

FASTQ File

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. - Wikipedia

FASTQ File

Example (files can be gigabytes in size!)

@K00233:37:HGHLYBBXX:3:1101:2646:1121 1:N:0:NACGCATC+NGCTAATG
NCGCATGAGCCGCCTGTATCAGGCGCTGATCGAACCGGGCATTGCAGTTGGGATAGATCGGAAGAGCACACGTCTG
+
#A7F<<AA<JFJFJJJJJJFFJJJJJJJAFFJFJJJJJJJFJAFFFJAJFJJ<FJJJJJFFF<FFA--FFFJJJJJ
@K00233:37:HGHLYBBXX:3:1101:4655:1121 1:N:0:NACGCATC+NGCTAATG
NATGCATGACAGGAGGTGAGGGCATTTTCCAGATTTTCAGGCTGCGACCTTGAGCATCTTTCGCCGCTTCCAGCAC
+
#AA-<FFFF7JFF7JJJJJFJJ<JJJJJA7FJJJJJJJFF<JFF<J7-<FJJJJFJFFJJJAAAAFFJJ--AJAJJ
@ <read id, e.g. machine ID, location on flowcell> <extra metadata>
  <DNA sequence; Note: N = base couldn't be called!>
+ <a separator>
  <base quality scores for each nucleotide in sequence>

Quality score

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
0.2......................26...31........41          

Recap

  • DNA molecules essentially:
    • Made up of nucleotides (ACTG)
    • Two strands: complementary base pairs (C-G, A-T)
    • Modern DNA is long, aDNA is: short
  • NGS Sequencing:
    • Massively multiplexed: millions DNA molecules at once
    • Add adapters to bind to a glass slide (lawn)
    • Make new strand, adding florescent nucleotides
    • Fire laser at each nucleotide and take photo
    • Desyncing of clusters result in lower base-quality scores over time
    • Improve by paired-end sequencing
  • Results in FASTQ file

Practical: Introduction to NGS data processing

Working on the command-line

What is the command line?

A command-line interface (CLI) processes commands to a computer program in the form of lines of text. - Wikipedia

  • i.e. use words, not point and click with mouse
  • Important: more efficient/scalable & more reproducible
  • Most bioinformatics work is performed via command line
    • Often as working on remote servers (i.e. very large computers with no screen)

Logging into a server

  1. Open browser
  2. Go to your assigned IP address
  3. Log-in with your assigned city username and specify a password
  4. New > Terminal

Remember this password - you won’t be asked for confirmation & you will re-use in Microbiome Data Analysis!

The command line

A command prompt (or just prompt) is a sequence of (one or more) characters used in a command-line interface to indicate readiness to accept commands. - Wikipedia

james_fellows_yates@bionc21:~$ 
<username>@<machine_name>:<current_directory>$
  • Everything after $ is where you type your command
  • Never copy and paste the prompt!
  • ⚠️Prompts look different on different machines!

Your first command

Type in everything after the prompt, and press enter/return (⏎) on your keyboard with

$ echo "Hello world!"
Hello world!
  • Command typically consists of:
    1. Program/software/tool name
    2. Arguments (e.g. input files)
    3. Options or flags (e.g. -h or --help)

Move around

What is in the room (directory)

$ ls

Lets go in the directory, and see what’s in there!

$ cd input/
$ ls -l

How to go back?

$ cd ../

Your first bioinformatic job

  1. Check quality of sequencing/reads
  2. Remove adapters
  3. Merge paired-end reads
  4. Check quality of reads again

Your first bioinformatic job

We will run the nf-core/eager pipeline.

nf-core/eager is a scalable and reproducible bioinformatics best-practise processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA (aDNA) data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes.

Pipeline (software): a chain of data-processing processes or other software entities

Your first bioinformatic job

Fellows Yates et al. (2021) PeerJ. DOI: 10.7717/peerj.10947

Run your first bioinformatic job

nextflow run nf-core/eager -profile singularity,test_tsv --input input/fastqs.tsv --fasta input/Mammoth_MT_Krause.fasta

What are we doing? - FastQC

What are we doing? - AdapterRemoval

Icons designed by OpenMoji. License: CC BY-SA 4.0

Practical: Introduction to NGS data quality control

MultiQC Report

Fellows Yates et al. (2021) PeerJ. DOI: 10.7717/peerj.10947

Github copy: dag-material/<Intro to NGS>/assets/files/multiqc_report_testtsv_eager2_2_0.html